Data Mining on Thrombin Dataset
نویسندگان
چکیده
This document describes our work on the task 1 of KDD cup 2001. Different feature selection approaches, different inductive algorithms are employed to analyze the dataset. The results are evaluated and integrated by means of ROC analysis. 1. Background Three tasks in KDD cup 2001 are involved. All of them focus on data from genomics and drug design. The first task is a classification task for a large size of propositionalised dataset, a half-gigabyte-dataset, or thrombin dataset. Both the second task and the third task focused a relational dataset. This documents only discusses the first task. Drugs are typically small organic molecules that achieve their desired activity by binding to a target site on a receptor. The first step in the discovery of a new drug is usually to identify and isolate the receptor to which it should bind, followed by testing many small molecules for their ability to bind to the target site. This leaves researchers with the task of determining what separates the active (binding) compounds from the inactive (non-binding) ones. Such a determination can then be used in the design of new compounds that not only bind, but also have all the other properties required for a drug (solubility, oral absorption, lack of side effects, appropriate duration of action, toxicity, etc.) [1]. Thrombin dataset consists of compounds tested for their ability to bind to a target site on thrombin, a key receptor in blood clotting. 2. Data Understanding There are 139352 features and 1909 instances in thrombin dataset. There are 139351 input features (or predictors), the values of which are 0 or 1. The target feature is ÔActivityÕ, the values of which are ÔAÕ and ÔIÕ. There are 42 ÔAÕs and 1867 ÔIÕs in this dataset.
منابع مشابه
Using a Data Mining Tool and FP-Growth Algorithm Application for Extraction of the Rules in two Different Dataset (TECHNICAL NOTE)
In this paper, we want to improve association rules in order to be used in recommenders. Recommender systems present a method to create the personalized offers. One of the most important types of recommender systems is the collaborative filtering that deals with data mining in user information and offering them the appropriate item. Among the data mining methods, finding frequent item sets and ...
متن کاملMINING FUZZY TEMPORAL ITEMSETS WITHIN VARIOUS TIME INTERVALS IN QUANTITATIVE DATASETS
This research aims at proposing a new method for discovering frequent temporal itemsets in continuous subsets of a dataset with quantitative transactions. It is important to note that although these temporal itemsets may have relatively high textit{support} or occurrence within particular time intervals, they do not necessarily get similar textit{support} across the whole dataset, which makes i...
متن کاملHigh performance of the support vector machine in classifying hyperspectral data using a limited dataset
To prospect mineral deposits at regional scale, recognition and classification of hydrothermal alteration zones using remote sensing data is a popular strategy. Due to the large number of spectral bands, classification of the hyperspectral data may be negatively affected by the Hughes phenomenon. A practical way to handle the Hughes problem is preparing a lot of training samples until the size ...
متن کاملExtracting the Hidden Patterns Affecting Mental Health through Data Mining Techniques
Background and Objective: This study was conducted to shed light on the hidden relationships, trends, and patterns of the teenagers’ mental health dataset based on data mining techniques. Materials and Methods: The proposed method has four parts as follows: data preprocessing, data cleaning, target class selection, and extracting rules. The classes included inappropriate, moderate, and accepta...
متن کاملEnhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...
متن کامل